#### RISC vs CISC

- Reduced Instruction Set Computer <u>vs</u> Complex Instruction Set Computers
- for a given benchmark the performance of a particular computer:

$$P = \frac{1}{I*C*\frac{1}{S}}$$
where  $P = time \ to \ execute$ 

$$I = number \ of \ instructions \ executed$$

$$C = clock \ cycles \ per \ instruction$$

$$S = clock \ speed$$

- RISC approach attempts to reduce C
- CISC approach attempts to reduce I
- assuming identical clock speeds:

a RISC will execute more instructions for a given benchmark than a CISC [≈10..30%]

#### RISC-I

- history
- RISC-1 designed by MSc students under the direction of David Patterson and Carlo H.
   Séquin at UCLA Berkeley
- released in 1982
- first RISC now accepted to be the IBM 801 [1980], but design not made public at the time
- John Cocke later won both the Turing award and the Presidential Medal of Science for his work on the 801
- RISC-1 similar to SPARC [Sun, Oracle] and DLX/MIPS [discussing its pipeline later]
- http://www.eecs.berkeley.edu/Pubs/TechRpts/1982/CSD-82-106.pdf

# RISC-I Design Criteria

For an effective single chip solution artificially placed the following design constraints:

- execute one instruction per cycle [instructions must be simple to be executed in one clock cycle]
- make all instructions the same size [simplifies instruction decoding]
- access main memory with load and store instructions [load/store architecture]
- ONLY one addressing mode [indexed]
- limited support for high level languages [which means C and hence Unix]
   procedure calling, local variables, constants, ...

#### RISC-I architecture

- 32 x 32 bit registers r0 .. r31 [R0 always 0]
- PC and PSW [status word]
- 31 different instructions [all 32 bits wide]
- instruction formats



NB: 13 + 19 = 32

#### RISC-I architecture...

- opcode 128 possible opcodes
- scc if set, instruction updates the condition codes in PSW
- dst specifies one of 32 registers r0..r31
- src1 specifies one of 32 registers r0..r31
- imm, src2 if (imm == 0) then 5 lower order bits of src2 specifies one of the 32 registers r0..r31
  - if (imm == 1) then src2 is a <u>sign extended</u> 13 bit constant
- Y 19 bit constant/offset used primarily by relative jumps and Idhi [load high immediate]

#### RISC-I Arithmetic Instructions

12 arithmetic instructions which take the form

 $R_{dst} = R_{src1}$  op  $S_2$  NB: 3 address

NB: S<sub>2</sub> specifies a register or an immediate constant

operations

add, add with carry, subtract, subtract with carry, reverse subtract, reverse subtract with carry

and, or, xor

sll, srl, sra [shifts register by  $S_2$  bits where  $S_2$  can be (i) an immediate constant or (ii) a value in a register]

NB: NO mov, cmp, ...



# Synthesis of some IA32 instructions...

• loading constants  $-2^{12} < N < 2^{12}-1$ 

[constant fits into src2 field]

 $\quad \text{mov} \quad \mathsf{R_{n}}\text{, N} \, \rightarrow \quad \text{ add } \; \mathsf{R_{0}}\text{, \#N, R_{n}}$ 

• loading constants  $(N < -2^{12}) \mid \mid (N > 2^{12}-1)$ 

[constant too large for src2 field]

construct large constants using two instructions



\* may not be correct

#### Load and Store Instructions

5 load and 3 store instructions

| ldl  | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $R_{dst} = [R_{src1} + S_2]$ | load 32 long [32 bits]                |
|------|-------------------------------------------------------|------------------------------|---------------------------------------|
| ldsu | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $R_{dst} = [R_{src1} + S^2]$ | load short unsigned [16 bits]         |
| ldss | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $R_{dst} = [R_{src1} + S_2]$ | load short signed [16 bits]           |
| ldbu | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $R_{dst} = [R_{src1} + S_2]$ | load byte unsigned                    |
| ldbs | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $R_{dst} = [R_{src1} + S_2]$ | load byte signed                      |
| stl  | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $[R_{src1} + S_2] = R_{dst}$ | store long                            |
| sts  | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $[R_{src1} + S_2] = R_{dst}$ | store short [low 16 bits of register] |
| stb  | (R <sub>src1</sub> )S <sub>2</sub> , R <sub>dst</sub> | $[R_{src1} + S_2] = R_{dst}$ | store byte [low 8 bits of register]   |

- load unsigned clears most significant bits of register
- load signed extends sign across most significant bits of register
- indexed addressing [R<sub>src1</sub> + S<sub>2</sub>]
- S<sub>2</sub> must be a constant [can also be a register in RISC II]

# Synthesis of IA32 addressing modes

- register  $\rightarrow$  add  $R_0$ ,  $R_m$ ,  $R_n$
- immediate  $\rightarrow$  add R<sub>0</sub>, #N, R<sub>n</sub>
- indexed  $\rightarrow$  Idl (R<sub>src1</sub>)S<sub>2</sub>, R<sub>dst</sub>

$$R_{dst} = [R_{src1} + S_2]$$

• absolute/direct  $\rightarrow$  IdI (R<sub>0</sub>)S<sub>2</sub>, R<sub>dst</sub>

$$R_{dst} = [S_2]$$

- since S<sub>2</sub> is a 13 bit signed constant this addressing mode is very limited
- can ONLY access the top and bottom 4K (2<sup>12</sup>) of the address space!

### **RISC-I Register Windows**

- single cycle function call and return?
- need to consider parameter passing, allocation of local variables, saving of registers etc.
- "since the RISC-I microprocessor core is so simple, there's plenty of chip area left for multiple register sets"
- each function call allocates a new "window" of registers from a circular on-chip register file
- scheme based on the notion that the registers in a register window are used for specific purposes





# RISC-I Register Windows Organisation

- example shows function A calling function B
- CWP [current window pointer] points to current register window in circular on-chip register file
- on a function call CWP moved so that a new window of registers r10..r25 [16 registers] allocated from the register file
  - r10..r15 of the calling function are now mapped onto r26..r31 of the called function [used to pass parameters]



#### RISC-I Function Call and Return

the CALL and CALLR instructions take the form

[NB: SPARC always uses r15 for the return address]

#### RISC-I Procedure Call and Return...

the RET instruction takes the form

```
RET (R_{dst})S_2

PC \leftarrow R_{dst} + S_2 ; return address + constant offset

CWP \leftarrow CWP + 1 ; previous register window
```

- CALL and RET must use the same register for R<sub>dst</sub>
- in most cases, functions can be called in a "single cycle"
  - parameters store directly in r10..r15
  - no need to save registers as a new register window allocated
  - use new registers for local variables

# Register File Overflow/Underflow

what happens if functions nest too deeply and CPU runs out of register windows?



FIGURE 8.32 Change in procedure nesting depth over time. The boxes show procedure calls and returns inside the buffer before a window overflow or underflow. The program starts with three calls, a return, a call, a return, three calls, and then a window overflow.

need a mechanism to handle register file overflow and underflow

[Hennessy and Patterson]

# Register File Overflow/Underflow...



FIGURE 8.33 Number of banks or windows of registers versus overflow rate for several programs in C, LISP, and Smalltalk. The programs measured for C include a C compiler, a Pascal interpreter, troff, a sort program, and a few UNIX utilities [Halbert and Kessler 1980]. The LISP measurements include a circuit simulator, a theorem prover, and several small LISP benchmarks [Taylor et al. 1986]. The Smalltalk programs come from the Smalltalk macro benchmarks [McCall 1983] which include a compiler, browser, and decompiler [Blakken 1983 and Ungar 1987].

[Hennessy and Patterson]

# Register File Overflow/Underflow...

- can run out of register windows if functions nest deep enough [overflow]
- register window overflow can ONLY occur on a CALL/CALLR
  - need to save [spill] oldest register window onto a stack in main memory
- register window underflow can ONLY occur on a RET
  - there <u>must</u> always be at least two valid register windows in register file [window
     CWP contains registers r10..r25 and window CWP-1 contains r26..r31]
  - need to restore register window from stack

# Register File Overflow

- typical register file overflow sequence
- SWP = save window pointer [points to oldest register window in register file]
- SWP++ performed using modulo arithmetic as register file is circular
- r1 used as a stack pointer



- 1. function calls already 8 deep [register windows 0 to 7]
- 2. CWP -> register window 7, SWP -> register window 2 [oldest window]
- 3. two register windows already pushed onto stack [register windows 0 and 1]
- 4. another call will result in a register file overflow
- 5. register window 2 pushed onto stack [pointed to by SWP]
- 6. CWP and SWP move down one window [CWP++ and SWP++]

# Register File Overflow

PSW contains a CWP and SWP



before a CALL/CALLR instruction is executed, the following test is made

```
if (CWP + 1 == SWP)
TRAP(register file overflow)
```

- the trap handler must save the registers pointed to by SWP onto a stack in main memory
- How might this be done?? (i) instruction to switch to SWP window so that r10..r25 can be saved on stack using standard instructions (ii) an instruction to increment SWP and (iii) an instruction to move back to the CWP window so the CALL can be re-executed without raising a trap

# Register File Underflow

- always need 2 valid register windows in register file [SWP -> oldest valid register window]
- SWP window contains CWP's r26..31
- following test made when RET executed

- pop data from stack to restore window SWP-1
- CWP and SWP move up one window [CWP-- and SWP--]



# Problems with Multiple Register Sets?

- must save/restore 16 registers on an overflow/underflow even though only a few may be in use
- saving multiple register sets on a context switch [between threads and processes]
- referencing variables held in registers <u>by address</u> [a register does NOT normally have an address]

```
 p(\text{int i, int *j}) \\ \{ \\ *j = \dots \\ \}    p(\text{int i, int &j}) \\ \{ \\ j = \dots \\ \}    p(\text{int i, int &j}) \\ \{ \\ int i, j; \\ (\text{can j be allocated to a register as it is passed to p by address?} \\ (\text{int i, j; } // \text{can j be allocated to a register as it is passed to p by address?} \\ (\text{int i, j; } // \text{i in r16 and j in r17?} \\ p(i, \&j); \\ p(i, \&j); // \text{pass i by value and j by address?}    p(i, j) // \text{j passed by address}
```

# **Proposed Solution**

- solution proposed in original Computer IEEE paper
- "RISC-I solves that problem by giving addresses to the window registers. By reserving
  a portion of the address space, we can determine, with one comparison, whether a
  register address points to a CPU register or to one that has overflowed into memory.
  Because the only instructions accessing memory (load & store) already take an extra
  cycle, we can add this feature without reducing their performance."
- NOT implemented in RISC-I

### Proposed Solution..

- register file can be thought of as sitting on the top of stack in memory
- can then assign a notional address to each register in register file [where it would saved on stack if an overflow occurred]
- inside Q, j can be accessed as a register
- address of j passed to p(), compiler able to generation instructions to calculate its address relative to r1 [could add an instruction to instruction set specifically for this purpose] of where the register would be stored if spilled onto the stack
- \*j in p() will be mapped by load and store instructions onto a register if the address "maps" to the register file otherwise memory will be accessed



# RISC-I Pipeline

- two stage pipeline fetch unit and execute unit
- normal instructions

| fetch i1 | execute i1 |            |            |
|----------|------------|------------|------------|
|          | fetch i2   | execute i2 |            |
|          |            | fetch i3   | execute i3 |

load/store instructions

| fetch load | compute addr<br>fetch i2 | mem access<br>stall |                       |
|------------|--------------------------|---------------------|-----------------------|
|            |                          |                     | execute 2<br>fetch i3 |

- pipeline stall arises because it is NOT possible to access memory twice in the same clock cycle [fetch the next instruction and read/write target of load/store]
  - load/store 2 cycles [latency 3 cycles]
  - others 1 cycle [latency 2 cycles]

# **Delayed Jumps**

 RISC-I cycle long enough to (1) read registers, perform ALU operation and store result back in a register OR (2) read instruction from memory, BUT not both sequentially



- jmp/call/ret instructions are problematic since it is NOT possible [during one clock cycle] to calculate the destination address and ALSO fetch the destination instruction
- RISC-I solution is to use "delayed jumps"

# Delayed Jumps...

 jmp/call/ret effectively take place AFTER the following instruction [in the code] is executed

```
1 sub r16, #1, r16 {C} ;
2 jne L ; conditional jmp
3 xor r0, r0, r16 ;
4 sub r17, #1, r17 ;

10 L: sll r16, 2, r16 ;
```

if conditional jmp taken

effective execution order 1, 3, 2, 10, ...

if conditional jmp NOT taken

effective execution order 1, 3, 2, 4, ...

NB: jmp condition evaluated at the *normal time* [condition codes set by instruction 1 in this case]

# Delayed Jump Example

consider the RISC-I code for the following code segment

```
i = 0;  // assume i in r16
while (i<j)  // assume j in r17
  i += f(i);  // assume parameter and result in r10
k = 0;  // assume k in r18</pre>
```

# Delayed Jump Example...

- unoptimised
- place nop [xor r0, r0, r0] after each jmp/call/ret [in the delay slot]

```
add
                r0, r0, r16 // i = 0
                                                       what about this nop?
                 r16, r17, r0 {C} // i < j?
L0:
       sub
       jge
                 L1
                                                       instruction at L1?
                                    // nop
                r0, r0, r0
       xor
                                    // set up parameter in r10
       add
                 r0, r16, r10
                 <del>r0, r0, r0</del>
                                    // nop
                 r10, r16, r16
       add
                 LO
                 r0, r0, r0
       add
                 r0, r0, r18
L1:
```

# Delayed Jump Example

reorganised and optimized

```
add
              r0, r0, r16 // i = 0
              r16, r17, r0 {C} // i < j?
L0:
     sub
              L1
     ige
              r0, r0, r18 // k can be zeroed many times as...
      add
     callr
                        // operation idempotent
              r0, r16, r10 // set up parameter in r10
      add
              LO
     jmp
              r10, r16, r16 //i = i + f(i)
      add
L1:
```

- managed to place useful instructions in each delay slot
- setting up parameter in instruction after call to f() appears strange at first



- destination of jmp instruction is i4 [if jump NOT taken this will be the instruction after the delay slot]
- i3 executed in the delay slot
- better to execute an instruction in the delay slot than leaving execution unit idle
- since the instruction in the delay slot is fetched anyway, might as well execute it
- 60% of delay slots can be filled with useful instructions [Hennessy & Patterson]

#### What about??

```
i0 ....
jmp L1 // unconditional jump
jmp L2 // unconditional jump

L1: i10 ....
i11 ....

L2: i20 ....
i21 ....
```

best approach is to draw a pipeline diagram

#### What about?...



order i0, i10, i20, i21...

# **Pipelining**

- key implementation technique for speeding up CPUs [see Hennessy & Patterson]
- break each instruction into a series of small steps and execute them in parallel [steps from different instructions]
  - think of a car assembly line!
- clock rate set by the time needed for the longest step ideally time for each step should be equal
- consider a 5 stage instruction pipeline for the hypothetical DLX microprocessor [after Knuth's MIX]
  - IF instruction fetch
  - ID instruction decode and register fetch [operands]
  - EX execution and effective address calculation
  - MA memory access
  - WB write back [into a register]

# Pipelining...

| i   | IF | ID | EX | MA | WB |    |    |    |    |
|-----|----|----|----|----|----|----|----|----|----|
| i+1 |    | IF | ID | EX | MA | WB |    |    |    |
| i+2 |    |    | IF | ID | EX | MA | WB |    |    |
| i+3 |    |    |    | IF | ID | EX | MA | WB |    |
| i+4 |    |    |    |    | IF | ID | EX | MA | WB |

- execution time of an individual instruction remains the same...
- <u>BUT</u> throughput increased by the depth of the pipeline [5 times in this case]
- clock frequency 5 times faster than non-pipelined implementation
- good performance if pipeline runs without stalling

# Pipelining...

 for example, pipeline stalled while data is read from memory if memory access causes a cache miss

| İ   | IF | ID | EX | MA    | MA    | MA | WB |    |    |    |    |
|-----|----|----|----|-------|-------|----|----|----|----|----|----|
| i+1 |    | IF | ID | stall | stall | EX | MA | WB |    |    |    |
| i+2 |    |    | IF | IF    | IF    | ID | EX | MA | WB |    |    |
| i+3 |    |    |    |       |       | IF | ID | EX | MA | WB |    |
| i+4 |    |    |    |       |       |    | IF | ID | EX | MA | WB |

- stall normally between ID and EX phases
- instruction <u>issued</u> [from ID to EX phases] when it can be executed without stalling
  - 2 cycle cache miss penalty

# Pipelining...

- ALSO note that a non-pipelined DLX requires 2 memory access every 5 clock cycles while a pipelined DLX requires 2 memory accesses per clock cycle
  - IF: fetch instruction from memory
  - MA: read/write data from/to memory
  - helped by separate instruction and data caches internally [Harvard Architecture]

#### **Data Hazards**

consider the execution of the following instructions

$$r1 = r2 + r3$$
 [ADD]  
  $r4 = r1 - r5$  [SUB]

ADD instruction writes r1 in the WB stage, but SUB reads r1 in the ID stage



- problem solved in DLX by
  - pipeline forwarding [or bypassing] and...
  - two phase access to the register file

#### Data Hazards...

- alternative approach is to expose pipeline to programmers
- programmers would need to insert three instructions between ADD and SUB to get the expected result



# Pipeline Forwarding



- registers between each pipeline stage  $R_a$ ,  $R_b$ ,  $ALU_{out0}$ ,  $ALU_{out1}$  etc.
- all registers clocked synchronously

- the ALU results from the "previous" two instructions can be forwarded to the ALU inputs from the  $\rm ALU_{OUT0}$  &  $\rm ALU_{OUT1}$  pipeline registers before the results are written back to the register file
- tag ALU<sub>out0</sub> and ALU<sub>out1</sub> with the destination register
- EX stage checks for source register in order ALU<sub>out0</sub>, ALU<sub>out1</sub> and then A/B

# Two Phase Clocking

- DLX register file can be written then read in a single clock cycle
  - written during first half of cycle [WB phase]
  - read during second half of cycle [ID phase]
  - hence NO need for a third forwarding register [see slide 38]





- first instruction writes to r1 and the next four instructions use r1 as a source operand
- second instruction writes to r3 which is used as a source operand by the third and fourth instructions
- NB: the *intelligence* is in the EX phase, not the ID phase

#### **Load Hazards**

consider the following instruction sequence

- can't used result of load until data read from memory in MA phase
- a pipeline interlock occurs when a load hazard is detected, resulting in a pipeline stall
- loaded data must be forwarded to EX stage from ALU<sub>out1</sub>
- could remove stall by moving "&" instruction and placing it between load and add
- often possible to reschedule instructions to avoid this type of pipeline stall

## Instruction Scheduling Example

consider the following instruction sequence where a .. f are memory locations

```
a \leftarrow b + c
 d \leftarrow e - f
```

compiler generated scheduled code would be as follows

```
 \begin{array}{l} r2 \leftarrow M[b] \\ r3 \leftarrow M[c] \\ r5 \leftarrow M[e] \\ r1 \leftarrow r2 + r3 \\ r6 \leftarrow M[f] \\ M[a] \leftarrow r1 \\ r4 \leftarrow r5 - r6 \\ M[d] \leftarrow r4 \end{array} \right. ; \\ \begin{array}{l} \text{load/store swapped to avoid stall in } \underline{\text{sub}} \\ \end{array}
```

- access to many registers critical for a legal schedule
- pipeline scheduling generally increases registers usage

# **DLX Pipeline Operation**

- register transfer description
- ALU instructions

| IF | $IR \leftarrow M[PC]; PC \leftarrow PC+4$                                                     |
|----|-----------------------------------------------------------------------------------------------|
| ID | $A \leftarrow R_{SRC1}$ ; $B \leftarrow R_{SRC2}$ ; $PC1 \leftarrow PC$ ; $IR1 \leftarrow IR$ |
| EX | ALU <sub>OUT0</sub> ← result of ALU operation                                                 |
| MA | $ALU_{OUT1} \leftarrow ALU_{OUT0}$                                                            |
| WB | $R_{DST} \leftarrow ALU_{OUT1}$                                                               |

• Load/Store instructions

| IF | $IR \leftarrow M[PC]; PC \leftarrow PC+4$                                                    |
|----|----------------------------------------------------------------------------------------------|
| ID | $A \leftarrow R_{SRC1}$ ; $B \leftarrow R_{DST}$ ; $PC1 \leftarrow PC$ ; $IR1 \leftarrow IR$ |
| EX | MAR ← effective address; SMDR ← B                                                            |
| MA | $ LMDR \leftarrow M[MAR] \text{ or } M[MAR] \leftarrow SMDR $                                |
| WB | $R_{DST} \leftarrow LMDR$                                                                    |

# DLX Pipeline Operation...

• BNEZ/BEQZ instructions [conditional branch]

| IF | $IR \leftarrow M[PC]; PC \leftarrow PC+4$                                                     |
|----|-----------------------------------------------------------------------------------------------|
| ID | $A \leftarrow R_{SRC1}$ ; $B \leftarrow R_{SRC2}$ ; $PC1 \leftarrow PC$ ; $IR1 \leftarrow IR$ |
| EX | $ALU_{OUTO} \leftarrow PC1+ offset; cond \leftarrow R_{SRC1} op 0$                            |
| MA | if (cond) PC ← ALU <sub>OUT0</sub>                                                            |
| WB | idle                                                                                          |

#### **Control Hazards**

a simple DLX branch implementation results in a 3 cycle stall per branch instruction

| branch | IF |    |       |       |     |    |    |    |    |    |    |  |
|--------|----|----|-------|-------|-----|----|----|----|----|----|----|--|
| i1     |    | IF | stall | stall | *IF | ID | EX | MA | WB |    |    |  |
| i2     |    |    |       |       |     | IF | ID | EX | MA | WB |    |  |
| i3     |    |    |       |       |     |    | IF | ID | EX | MA | WB |  |

- new PC not known until the end of MA
- 3 cycle penalty whether branch is taken or NOT
- a 30% branch frequency and a 3 cycle stall results in ONLY  $\approx$  50% of the potential pipeline speed up [consider 100 instructions: non-pipelined 500; perfectly pipelined 100; 3 cycle branch stall 30 x 4 + 70 = 190]
- need to (1) determine if branch is taken or not taken earlier in pipeline and (2) compute target address earlier in pipeline

#### **DLX Branches**

- DLX doesn't have a conventional condition code register
- uses a "set conditional" instruction followed by a BEQZ or BNEZ instruction
  - sets register with 0 or 1 depending on the comparison of two source operands

```
SLT r1, r2, r3; r1 = (r2 < r3)? 1:0
BEQZ r1, L; branch to L if (r1 == 0)
```

- also SGT, SLE, SGE, SEQ and SNE [NB: also need unsigned comparisons]
- may need additional instructions compared with an instruction set where instructions implicitly set the condition codes

#### **DLX Branches...**

- DLX uses additional hardware to resolve branches during the ID stage
- test if a register == /!= 0 and adds offset to PC if condition true

| IF | $IR \leftarrow M[PC]; PC \leftarrow PC+4$           |
|----|-----------------------------------------------------|
| ID | if $(R_{SRC1} == / != 0) PC \leftarrow PC + offset$ |
| EX | idle                                                |
| MA | idle                                                |
| WB | idle                                                |

now a ONE cycle branch penalty

| branch | IF | ID~ | <b>EX</b> | MA | WB |    |    |    |    |  |
|--------|----|-----|-----------|----|----|----|----|----|----|--|
| i1     |    | IF  | IF        | ID | EX | MA | WB |    |    |  |
| i2     |    |     |           | IF | ID | EX | MA | WB |    |  |
| i3     |    |     |           |    | IF | ID | EX | MA | WB |  |

stalls pipeline until branch target known

#### **DLX Branches...**

- further improve by assuming branch NOT taken
  - pipeline stalled ONLY if branch taken
  - must undo any side effects if branch taken [minimal]



# Comparison of DLX Branch Strategies

- also compare with delayed branches [as per RISC-I and SPARC]
- assume
  - 14% branch instructions
  - 65% of branches change PC
  - 50% probability of filling branch delay slot with a useful instruction

| method            | branch penalty | effective clocks per instruction (CPI)             |
|-------------------|----------------|----------------------------------------------------|
| stall pipeline 3  | 3              | 0.86 + 0.14 x 4 = <b>1.42</b>                      |
| stall pipeline 1  | 1              | $0.86 + 0.14 \times 2 = 1.14$                      |
| predict NOT taken | 1              | $0.86 + 0.14 \times (0.35 + 2 \times 0.65) = 1.09$ |
| delayed branch    | 0.5            | 0.86 + 0.14 x 1.5= <b>1.07</b>                     |

#### **Branch Prediction**

- need to resolve branches during IF phase
- use a branch target buffer [BTB]
- during IF fetch, look for PC in branch target buffer [compare against all PCs in parallel]
- if match found, use predicted PC to fetch next instruction
- if branch <u>correctly</u> predicted, NO pipeline STALL
- during IF phase match PC against all PCs in branch target branch buffer prediction state contains the PC predicted PC PCs of [ALL] branch instructions executed output predicted PC if match found
- if branch <u>incorrectly</u> predicted, must "abort" fetched instruction and fetch correct one [pipeline stalled for one clock cycle]
- must update BTB if a new branch fetched or prediction changes

#### **Branch Prediction Flowchart**

- flowchart showing the execution of a branch instruction
- 1 cycle penalty to fetch correct instruction if incorrect instruction fetched [ie. branch incorrectly predicted]
- BTB is a cache [has a limited capacity – may need to replace entries]
- need ONLY place taken branches in BTB as following instruction will be fetched by default anyway
- NB: default predict NOT taken



#### Two Bit Branch Prediction

consider the following loop

```
add r0, #10, r1
L1: ...
...
sub r1, #1, r1
bnez r1, L1
```

- assume BTB empty and predict branch will branch the same way as it did last time
- first time bnez executed, predicted incorrectly [as default prediction branch NOT taken]
- next 8 times, bnez predicted correctly [predict taken, branch taken]
- next time, bnez predicted incorrectly [predict taken, branch NOT taken]
- now assume branch remains in BTB and loop executed again
- bnez will be predicted incorrectly [predict NOT taken, branch taken] and so on...
- 80% prediction accuracy
- a two bit scheme which changes prediction only when prediction is incorrect twice in succession gives a 90% prediction accuracy [with this example]

## **Branch Prediction Analysis**

- assume
  - 14% branch instructions
  - 90% probability of branch hitting the branch target buffer [as it has a finite capacity]
  - 90% probability of a correct prediction
  - 1 cycle penalty if branch target buffer needs updating
- branch penalty calculation
  - penalty = %hits x %mispredictions x 1 + %misses x 1
  - penalty =  $(0.9 \times 0.1 \times 1) + (0.1 \times 1)$
  - penalty = 0.19 cycle
  - branches take 1.19 cycles
  - as 14% branches results in 1.0266 effective clocks per instruction [CPI]
- compares favourably with delayed branching [1.07]

### Summary

- you are now able to:
  - outline the history of RISCs
  - outline the design criteria and architecture of the RISC-1 microprocessor
  - analyse the operation and limitations of register windows
  - analyse the operation and motivation for delayed jumps
  - develop simple RISC-1 assembly language programs
  - describe the operation of the 5 stage DLX/MIPS pipeline
  - explain how data and load hazards are resolved
  - analyse a number of different approaches for handling control hazards
  - explain how branch prediction is implemented
  - use a DLX/MIPS pipeline simulator
  - predict the number of clock cycles needed to execute DLX/MIPS code segments